Data Position and Profiling in Domain-Independent Warehouse Cleaning

نویسندگان

Christie I. Ezeife

Ajumobi Udechukwu

چکیده

A major problem that arises from integrating different databases is the existence of duplicates. Data cleaning is the process for identifying two or more records within the database, which represent the same real world object (duplicates), so that a unique representation for each object is adopted. Existing data cleaning techniques rely heavily on full or partial domain knowledge. This paper proposes a positional algorithm that achieves domain independent de-duplication at the attribute level. The paper also proposes a technique for field weighting through data profiling, which, when used with the positional algorithm, achieves domain-independent cleaning at the record level. Experiments show that the positional algorithm achieves more accurate de-duplication than existing algorithms.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Modeling the Data Warehouse Refreshment Process as a Workflow Application

This article is a position paper on the nature of the data warehouse refreshment which is often defined as a view maintenance problem or as a loading process. We will show that the refreshment process is more complex than the view maintenance problem, and different from the loading process. We conceptually define the refreshment process as a workflow whose activities depend on the available pro...

متن کامل

Eliminating Fuzzy Duplicates in Data Warehouses

1 Work done while visiting Microsoft Research Abstract The duplicate elimination problem of detecting multiple tuples, which describe the same real world entity, is an important data cleaning problem. Previous domain independent solutions to this problem relied on standard textual similarity functions (e.g., edit distance, cosine metric) between multi-attribute tuples. However, such approaches ...

متن کامل

Identification of Categorical Registration Data of Domain Names in Data Warehouse Construction Task

This work is dedicated to formation of data warehouse for processing of a large volume of registration data of domain names. Data cleaning is applied in order to increase the effectiveness of decision making support. Data cleaning is applied in warehouses for detection and deletion of errors, discrepancy in data in order to improve their quality. For this purpose, fuzzy record comparison algori...

متن کامل

On Data Cleaning In Building XML Data Warehouses

One of the most important aspects in building an XML data warehouse is data cleaning and integration process. This paper presents a detailed methodology for cleaning data and integrating, especially useful for general situations when different-source documents are involved. Both situations whereby the XML documents have an associated XML Schema or they are just independent XML documents are con...

متن کامل

A Unified Framework and Sequential Data Cleaning Approach for a Data Warehouse

The data cleaning is the process of identifying and removing the errors in the data warehouse. Data cleaning is very important in data mining process. Most of the organizations are in the need of quality data. The quality of the data needs to be improved in the data warehouse before the mining process. The framework available for data cleaning offers the fundamental services for data cleaning s...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 2003

Data Position and Profiling in Domain-Independent Warehouse Cleaning

نویسندگان

چکیده

منابع مشابه

Modeling the Data Warehouse Refreshment Process as a Workflow Application

Eliminating Fuzzy Duplicates in Data Warehouses

Identification of Categorical Registration Data of Domain Names in Data Warehouse Construction Task

On Data Cleaning In Building XML Data Warehouses

A Unified Framework and Sequential Data Cleaning Approach for a Data Warehouse

عنوان ژورنال:

اشتراک گذاری